Locality based quorum eligibility

ABSTRACT

Disclosed are various embodiments for distributing data items. A plurality of nodes forms a distributed data store. A new master candidate is determined through an election among the plurality of nodes. Before performing a failover from a failed master to the new master candidate, a consensus is reached among a locality-based failover quorum of the nodes. The quorum excludes any of the nodes that are in a failover quorum ineligibility mode.

BACKGROUND

A data store, such as, for example, a non-relational database, arelational database management system (RDBMS), or other data systems maybe implemented as a distributed system. Distributed systems can offerimproved reliability and availability, better fault tolerance, increasedperformance, and easier expansion. Some distributed models employsingle-master replication, where data written to a master data store isreplicated to one or more secondary stores. Distributed data stores mayexperience difficulties if the master data store fails.

BRIEF DESCRIPTION OF THE DRAWINGS

Many aspects of the present disclosure can be better understood withreference to the following drawings. The components in the drawings arenot necessarily to scale, emphasis instead being placed upon clearlyillustrating the principles of the disclosure. Moreover, in thedrawings, like reference numerals designate corresponding partsthroughout the several views.

FIG. 1 is a drawing of a networked environment according to variousembodiments of the present disclosure.

FIG. 2 is another view of the networked environment of FIG. 1 accordingto various embodiments of the present disclosure.

FIG. 3 is a flowchart illustrating an example of functionalityimplemented as portions of a data store management application executedin a computing device in the networked environment of FIG. 1 accordingto various embodiments of the present disclosure.

FIG. 4 is a flowchart illustrating another example of functionalityimplemented as portions of a data store management application executedin a computing device in the networked environment of FIG. 1 accordingto various embodiments of the present disclosure.

FIG. 5 is a flowchart illustrating yet another example of functionalityimplemented as portions of a data store management application executedin a computing device in the networked environment of FIG. 1 accordingto various embodiments of the present disclosure.

FIG. 6 is a flowchart illustrating still another example offunctionality implemented as portions of a data store managementapplication executed in a computing device in the networked environmentof FIG. 1 according to various embodiments of the present disclosure.

FIG. 7 is a schematic block diagram that provides one exampleillustration of a computing device employed in the networked environmentof FIG. 1 according to various embodiments of the present disclosure.

DETAILED DESCRIPTION

The present disclosure relates to failover recovery in a distributeddata store. In one embodiment, a distributed data store can employ asingle-master replication model that provides for a master data storeand one or more slave data stores. The master data store can receiveupdates to data items stored in the distributed data store received fromclient systems and propagate the updates to the slave data stores. Uponpropagating the update to a requisite number of slave data stores, themaster data store can then consider the update as successful, durable,and/or committed to the distributed data store. To provide datadurability or integrity from a client or user point of view, any updateto a data item acknowledged to the user as successful in a distributeddata store according to embodiments of the disclosure should be able tosurvive the failure of the master data store. In such a scenario, aslave data store in the distributed data store can be designated as thenew master data store.

To provide such failover capability to the distributed data store, thenew master data store, previously a slave data store, must be able todetermine at least the last successful updates committed to thedistributed data store and acknowledge as successful to a client inorder to properly assume its role as the master. Before switching overto a new master, a consensus must be reached among a failover quorum.Having seen full participation from the appropriate failover quorum, thenewly elected master is guaranteed to know about all of the updates thathave gained locality-based durability. Excluding nodes from the failoverquorum using the techniques described herein allows the failover toproceed when those nodes might otherwise block for an arbitrary amountof time waiting for a particular node to take part in the quorum. Thisblocking might occur, for example, due to a node having failed or havingbeen taken down temporarily for maintenance.

With reference to FIG. 1, shown is a networked environment 100 accordingto various embodiments. The networked environment 100 includes one ormore computing devices 103 a . . . 103N in data communication with oneor more client devices 106 by way of a network 109. The network 109includes, for example, the Internet, intranets, extranets, wide areanetworks (WANs), local area networks (LANs), wired networks, wirelessnetworks, or other suitable networks, etc., or any combination of two ormore such networks.

Each of the computing devices 103 a . . . 103N may comprise, forexample, a server computer or any other system providing computingcapability. Alternatively, a plurality of computing devices 103 a . . .103N may be employed that are arranged, for example, in one or moreserver banks or computer banks or other arrangements. A plurality ofcomputing devices 103 a . . . 103N together may comprise, for example, acloud computing resource, a grid computing resource, and/or any otherdistributed computing arrangement. Such computing devices 103 a . . .103N may be located in a single installation or may be distributed amongmany different geographical locations. For purposes of convenience, thecomputing device 103 is referred to herein in the singular. Even thoughthe computing device 103 is referred to in the singular, it isunderstood that a plurality of computing devices 103 a . . . 103N may beemployed in the various arrangements as described above.

Various applications and/or other functionality may be executed in thecomputing device 103 according to various embodiments. Also, variousdata is stored in a respective data store 112 a . . . 112N that isaccessible to the computing device 103. The respective data store 112 a. . . 112N may be representative of a plurality of data stores as can beappreciated. The data stored in the data store 112, for example, isassociated with the operation of the various applications and/orfunctional entities described below. The data stored in a data store 112includes, for example, replicated data 115 and potentially other data.The replicated data 115 includes any data maintained in the data store112 that can be durably persisted across a distributed data storeimplemented by the various computing devices 103 in the system.

The components executed on the computing device 103, for example,include a data store management application 118, context ordered groupdelivery service 119, and other applications, services, processes,systems, engines, or functionality not discussed in detail herein. Thecontext ordered group delivery service 119 guarantees that every nodeprocesses a particular message by first processing every other messagewhich the particular message depends on. In other words, context orderedgroup delivery service 119 guarantees that the message context graph isalways a directed acyclical graph.

When a computing device 103 is designated as a master data store for adistributed data store implemented by computing devices 103 a . . .103N, the data store management application 118 takes on a master roleand is thus executed to manage the data store 112 and to facilitatereplication of data to one or more data stores 112 accessible tocomputing devices 103 that are designated as slave data stores. In amaster role, the data store management application 118 may obtain dataitem update requests 121 from the client device 106 and respond withdata item update confirmations 124. The updates may take the form ofwrites to the data store 112, for example. The master data storemanagement application 118 may also generate and send data itemreplication requests to the slave data store management applications 118and obtain data item replication confirmations from the slave data storemanagement applications 118.

When a computing device 103 is designated as a slave data store for adistributed data store implemented by computing devices 103 a . . .103N, the data store management application 118 takes on a slave roleand is thus executed to receive data item replication requests from amaster data store management application 118 and cause the correspondingdata item to be stored in the data store 112 managed by the slave datastore management applications 118. In other words, the slave data storemanagement applications 118 are each configured to obtain data itemreplication requests from the master data store management application118. In response to the data item replication requests, the slave datastore management application 118 is configured to commit data itemupdates to its respective data store 112 a . . . 112N and then generateand send data item replication confirmations to the master data storemanagement application 118.

The client device 106 is representative of a plurality of client devicesthat may be coupled to the network 109. The client device 106 maycomprise, for example, a processor-based system such as a computersystem. Such a computer system may be embodied in the form of a desktopcomputer, a laptop computer, a personal digital assistant, a cellulartelephone, a set-top box, a music player, a video player, a mediaplayer, a web pad, a tablet computer system, a game console, or otherdevices with like capability.

The client device 106 may be configured to execute various applicationssuch as a data store client application 127 and other applications. Thedata store client application 127 may be executed in a client device106, for example, to facilitate interaction with the data storemanagement application 118. In one embodiment, the data store clientapplication 127 may be configured, for example, to access and rendernetwork pages, such as web pages, or other network content served up bythe computing device 103, a web server, a page server, or other serversfor the purpose of interfacing with the data store managementapplication 118. The client device 106 may be configured to executeapplications beyond the data store client application 127 such as, forexample, browser applications, email applications, instant messageapplications, and/or other applications.

In various embodiments, the data store client application 127 maycomprise a thin client application, a thick client application, oranother type of client application. Some embodiments may include agraphical user interface and/or a command-line interface. In someembodiments, the client device 106 can be configured to interact with adistributed data store provided by the computing devices 103 a . . .103N via an application programming interface (API) provided by the datastore management application 118 executed in a master data store orslave data store.

A data item update request 121 is generated by a data store clientapplication 127. Although the data store client application 127 isdescribed as executed in a client device 106, it is understood that theclient device 106 may correspond to a server computer that processesbusiness logic, generates network pages, and/or performs other tasks.Thus, although a user may generate a data item update request 121through a user interface, a data item update request 121 may also begenerated automatically by business logic applications, workflowengines, network page generation applications, and/or otherapplications.

The data item update request 121 may correspond to a portion of anotherapplication, such as, for example, a module, a library, etc. in variousembodiments. The data item update request 121 may be sent over thenetwork 109 to the data store management application 118 using hypertexttransfer protocol (HTTP), simple object access protocol (SOAP), remoteprocedure call (RPC), remote method invocation (RMI), representationalstate transfer (REST), Windows Communication Foundation, and/or otherframeworks and protocols. In various embodiments, the data item updaterequest 121 may describe updates to data items by using, for example,structured query language (SQL), extensible markup language (XML),JavaScript object notation (JSON), yet another markup language (YAML),and/or other formats.

Turning now to FIG. 2, shown is another view of the networkedenvironment 100 (FIG. 1). Where FIG. 1 focused on structure of thecomponents, FIG. 2 focuses on how the computing devices 103 a . . . 103Nare distributed among physical locations. The computing devices 103 a .. . 103N may be referred to herein as nodes 103 or replicated nodes.Together, nodes 103 function as a distributed data store 200. Eachcomputing device 103 resides at a particular physical location, andthese locations can be grouped into availability zones. A collection ofcomputing devices 103 which all reside at the same physical location(e.g., building, campus, etc.) is commonly referred to as “data center.”The example networked environment 100 of FIG. 2 includes three datacenters 203 a, 203 b, 203 c. Availability zones and/or data centers aregeographically separated to some degree, but the degree of separationmay vary. That is, availability zones and/or data centers can bedistributed across a town, across a city, across a country, across theworld, etc. Such distribution provides for greater stability of adistributed data store 200 so that if a catastrophic event occurs in onelocation and may affect a subset of the nodes 103 in the distributeddata store 200, the catastrophic event does not jeopardize the system asa whole.

As noted above, at any point in time one node 103 acts as a master andthe other nodes 103 act as slaves. In the example networked environment100 of FIG. 2, node 103 m is the master node while nodes 103 a, 103 band 103 c are slave nodes. The master node 103 m is located at datacenter 203 a, as is slave node 103 a. Slave node 103 b is located atdata center 203 b and slave node 203 c is located at data center 203 c.It should be appreciated that a networked environment 100 can includeany number of data centers, a data center 203 can include any number ofnodes, and the master node can reside at any data center 203.

An overview of the operation of distributed data store 200 will now beprovided. A data store client application 127 executing on a clientdevice 106 generates a data item update request 121. The data itemupdate request 121 is received by the master node 103 m. The master node103 m sends a data item replication request 206 to the slave nodes 103a, 103 b and 103 c. The data item update request 121 may be an actualreplica of the originally received data item update request 121, aseparate request including some or all of the information in theoriginally received data item update request 121, or other variations asshould be appreciated.

After processing the data item replication request 206, the slave nodes103 a, 103 b and 103 c each send a data item replication acknowledgement209 back to the master node 103 m. After receiving a predefined quorumof acknowledgements 209, the master node 103 m responds to the datastore client application 127 with a data item update confirmation 124.The quorum required to send out this data item update confirmation 124is referred to herein as a durability quorum. In some embodiments, thedurability quorum is a locality-based durability quorum, as described inthe co-owned and co-pending patent application “Locality Based Quorums”having U.S. application Ser. No. 12/967,187, which is herebyincorporated by reference in its entirety.

The distributed data store 200 includes features which facilitaterecovery upon failure of the master node 103 m. A failure can berepresented by a hardware failure of some kind, an abnormal terminationof the data store management application 118, and/or other failure ascan be appreciated. Therefore, the remaining computing devices 103executing an instance of the data store management application 118 canelect a new master node by employing a consensus algorithm. In someembodiments, the data store management application 118 executed in thevarious computing devices 103 can be configured to collectively employ aPaxos election scheme in order to determine the identity of thecomputing device 103 that will serve as the master. The election of amaster among the various computing devices 103 in the distributed datastore 200 can also be determined by other methods of reaching consensusin a distributed system of peers as can be appreciated. The quorumrequired in the election of a new master is a locality-based failoverquorum, described below in connection with FIG. 3.

Referring next to FIG. 3, shown is a flowchart that provides one exampleof the operation of a portion of the data store management application118 (FIG. 1) according to various embodiments. In particular, theflowchart of FIG. 3 illustrates aspects of a failover process for adistributed data store. It is understood that the flowchart of FIG. 3provides merely an example of the many different types of functionalarrangements that may be employed to implement the operation of theportion of the data store management application 118 as describedherein. As an alternative, the flowchart of FIG. 3 may be viewed asdepicting an example of steps of a method implemented in the computingdevice 103 (FIG. 1) according to one or more embodiments.

Beginning at box 303, the data store management application 118 (FIG. 1)receives an indication that the master node 103 m has failed. Theindication may take the form of a timeout, a message from the masternode, a message from another node, or any other suitable implementation.Next at box 306, a new master candidate is determined by an electionamong nodes 103 other than the failed master node 103 m. The electionemploys a consensus algorithm as described above. Before acting as themaster (e.g., before receiving data item update requests 121 fromclients), at box 309 the data store management application 118 on thenewly elected master node 103 waits for consensus among a locality-basedfailover quorum that excludes any node that isfailover-quorum-ineligible. As used herein, locality-based failoverquorum is defined as participation from all nodes 103 in N-K−1 datacenters 203. However, nodes that are known to befailover-quorum-ineligible are ignored by the data store managementapplication 118 when seeking this quorum. Without this feature, the waitfor the failover quorum could block for an arbitrary amount of timewaiting for a particular node to take part in the quorum. This situationmight occur, for example, due to a node having failed or having beentaken down temporarily for maintenance.

Having seen full participation from the appropriate quorum of datacenters 203 at box 309, the newly elected master is guaranteed to knowabout all of the updates 121 that have gained locality-based durability.In some embodiments, the newly elected master ensures that all datadiscovered during the wait for consensus (box 309) is locality-baseddurable before transitioning to master. Locality-based durability isdescribed in the co-owned and co-pending patent application “LocalityBased Quorums” having U.S. application Ser. No. 12/967,187. At somepoint after consensus is reached in box 309, the data store managementapplication 118 executing on the newly elected master node 103transitions at box 312 from a new master candidate to the master andthus receives data item update requests 121 from clients. The process ofFIG. 3 is then complete.

As described above, locality-based quorum used during failover excludesany node that is failover-quorum-ineligible. Several methods oftransitioning failover-quorum-ineligible mode andfailover-quorum-eligible mode will be described in more detail inconnection with FIGS. 4, 5, and 6. Another method involves a systemoperator sending out an eligible or non-eligible message on behalf of anode that is, for example, unresponsive.

Turning now to FIG. 4, shown is a flowchart that provides additionaldescription of node ineligibility for the failover quorum from FIG. 3,according to various embodiments. It is understood that the flowchart ofFIG. 4 provides merely an example of the many different types offunctional arrangements that may be employed to implement the operationof the portion of the data store management application 118 (FIG. 1) asdescribed herein. As an alternative, the flowchart of FIG. 4 may beviewed as depicting an example of steps of a method implemented in thecomputing device 103 (FIG. 1) according to one or more embodiments.

Beginning at box 403, the data store management application 118 (FIG. 1)receives an event indicating the node should enterfailover-quorum-non-eligibility mode. The event may occur during agraceful shutdown, may occur as a result of system operatorintervention, or any other suitable trigger. Upon learning that the nodeshould no longer participate in the failover quorum, at box 406 themaster data store management application 118 stops sendingacknowledgements (209 in FIG. 2) for data item replication requests 206(FIG. 2) from the master node 103 m (FIG. 1). As explained above,outside of failover-quorum-non-eligibility mode the node does receive,process, and then acknowledge these data item replication requests 206from the master node 103 m. However, as long as the node is infailover-quorum-non-eligibility mode, no such acknowledgements 209 willoccur. At box 409, after ceasing acknowledgements 209, the data storemanagement application 118 notifies the other nodes in the distributeddata store 200 (FIG. 2) that this node is not eligible for the failoverquorum. The process of FIG. 4 is then complete.

The data item replication requests 206, data item replicationacknowledgements 209 and notifications (box 409) all utilizecontext-ordered group delivery service 119 (FIG. 1). The deliveryservice 119 guarantees that every node processes a particular message byfirst processing every other message which the particular messagedepends on. In other words, the message context graph is always adirected acyclical graph.

The use of context-ordered group delivery affects the processing of FIG.4 as follows. When the node sends, at box 409, the message indicatingentry of failover-quorum-non-eligibility-mode, every node that processesthis message is guaranteed (by virtue of context ordered messagedelivery) that the sending node (the one in non-eligibility mode) willno longer be acknowledging updates from the master. This in turn meansthat the node in non-eligibility mode does not have any updates thatneed to be discovered during failover. Therefore, the other nodes candiscount the non-eligible node in the failover quorum.

Moving on to FIG. 5, shown is a flowchart that provides additionaldescription of node eligibility for the failover quorum according tovarious embodiments. It is understood that the flowchart of FIG. 5provides merely an example of the many different types of functionalarrangements that may be employed to implement the operation of theportion of the data store management application 118 (FIG. 1) asdescribed herein. As an alternative, the flowchart of FIG. 5 may beviewed as depicting an example of steps of a method implemented in thecomputing device 103 (FIG. 1) according to one or more embodiments.

Beginning at box 503, the data store management application 118 (FIG. 1)receives an event indicating the node should enterfailover-quorum-eligibility mode. The event may occur during a gracefulrecovery after shutdown, may occur as a result of system operatorintervention, or any other suitable trigger. Upon learning that the nodeshould return to participation in the failover quorum, at box 506 themaster data store management application 118 notifies the other nodes inthe distributed data store 200 (FIG. 2) that this node is once againeligible for the failover quorum.

After notification to the other nodes, at box 509 the data storemanagement application 118 waits for acknowledgement of thisnotification from a locality-based durability super quorum. As definedherein, a locality-based durability quorum includes at least one noderesiding in each of the data centers. A locality-based durability superquorum then excludes the node which is entering the failover quorumeligibility mode (i.e., the node that sent the notification in box 506).This super quorum ensures that the node re-enteringfailover-quorum-eligibility mode will itself be part of the nextsuccessful failover quorum.

Next at box 512 data store management application 118 resumes sendingacknowledgements (209 in FIG. 2) for data item replication requests 206(FIG. 2) from the master node 103 m (FIG. 1). As explained above, aslong as the node is in failover-quorum-non-eligibility mode, no suchacknowledgements 209 will occur. However, the node has now transitionedto failover-quorum-eligibility mode and so acknowledges updates asnormal. The process of FIG. 5 is then complete. Like the process of FIG.4, the process of FIG. 5 also utilizes context-ordered group deliveryservice 119 for data item replication requests 206, data itemreplication acknowledgements 209, notifications (box 506), andnotification acknowledgements (box 509).

Referring now to FIG. 6, shown is a flowchart that describestransitioning between failover quorum non-eligibility and eligibility,according to various embodiments. It is understood that the flowchart ofFIG. 6 provides merely an example of the many different types offunctional arrangements that may be employed to implement the operationof the portion of the data store management application 118 (FIG. 1) asdescribed herein. As an alternative, the flowchart of FIG. 6 may beviewed as depicting an example of steps of a method implemented in thecomputing device 103 (FIG. 1) according to one or more embodiments.

Beginning at box 603, the node executing the data store managementapplication 118 (FIG. 1) is restarted. This restart may occur, forexample, after a graceful shutdown or an unexpected shutdown. Next, atbox 606, the data store management application 118 entersfailover-quorum-non-eligibility mode, described above in connection withFIG. 4. At box 609, at some later point after enteringfailover-quorum-non-eligibility mode after a restart, the data storemanagement application 118 prepares for full participation as a memberof the distributed data store 200 by beginning to process data itemreplication requests 206 received from the master node 103 m.

Next, at box 612, the data store management application 118 determineswhether the processed replication requests have caught up with the othernodes in the distributed data store 200. To do so, the data storemanagement application 118 may use services provided by context-orderedgroup delivery service 119 to look at how many replication requests havebeen processed by the other nodes in the distributed data store 200, andcomparing this number to the number of replicated writes that the datastore management application 118 has processed itself.

If at box 612 it is determined that the data store managementapplication 118 has not caught up with the other nodes, then the datastore management application 118 returns to box 609, where furtherreplication requests are processed. If, however, at box 612 it isdetermined that the data store management application 118 has caught upwith the other nodes, then the data store management application 118moves to box 616. At box 616, the data store management application 118enters failover-quorum-eligibility mode, described above in connectionwith FIG. 5. The process of FIG. 6 is then complete. Like the processesof FIG. 4 and FIG. 5, the process of FIG. 6 also utilizescontext-ordered group delivery service 119 for data item replicationrequests 206 and data item replication acknowledgements 209.

Moving on to FIG. 7, shown is a schematic block diagram of the computingdevice 103 according to an embodiment of the present disclosure. Thecomputing device 103 includes at least one processor circuit, forexample, having a processor 703 and a memory 706, both of which arecoupled to a local interface 709. To this end, the computing device 103may comprise, for example, at least one server computer or like device.The local interface 709 may comprise, for example, a data bus with anaccompanying address/control bus or other bus structure as can beappreciated.

Stored in the memory 706 are both data and several components that areexecutable by the processor 703. In particular, stored in the memory 706and executable by the processor 703 are the data store managementapplication 118, context-ordered group delivery service 119, andpotentially other applications. Also stored in the memory 706 may be adata store 112 and other data. In addition, an operating system may bestored in the memory 706 and executable by the processor 703. While notillustrated, the client device 106 also includes components like thoseshown in FIG. 7, whereby data store management application 118 is storedin a memory and executable by a processor.

It is understood that there may be other applications that are stored inthe memory 706 and are executable by the processors 703 as can beappreciated. Where any component discussed herein is implemented in theform of software, any one of a number of programming languages may beemployed such as, for example, C, C++, C#, Objective C, Java,Javascript, Perl, PHP, Visual Basic, Python, Ruby, Delphi, Flash, orother programming languages.

A number of software components are stored in the memory 706 and areexecutable by the processor 703. In this respect, the term “executable”means a program file that is in a form that can ultimately be run by theprocessor 703. Examples of executable programs may be, for example, acompiled program that can be translated into machine code in a formatthat can be loaded into a random access portion of the memory 706 andrun by the processor 703, source code that may be expressed in properformat such as object code that is capable of being loaded into a randomaccess portion of the memory 706 and executed by the processor 703, orsource code that may be interpreted by another executable program togenerate instructions in a random access portion of the memory 706 to beexecuted by the processor 703, etc. An executable program may be storedin any portion or component of the memory 706 including, for example,random access memory (RAM), read-only memory (ROM), hard drive,solid-state drive, USB flash drive, memory card, optical disc such ascompact disc (CD) or digital versatile disc (DVD), floppy disk, magnetictape, or other memory components.

The memory 706 is defined herein as including both volatile andnonvolatile memory and data storage components. Volatile components arethose that do not retain data values upon loss of power. Nonvolatilecomponents are those that retain data upon a loss of power. Thus, thememory 706 may comprise, for example, random access memory (RAM),read-only memory (ROM), hard disk drives, solid-state drives, USB flashdrives, memory cards accessed via a memory card reader, floppy disksaccessed via an associated floppy disk drive, optical discs accessed viaan optical disc drive, magnetic tapes accessed via an appropriate tapedrive, and/or other memory components, or a combination of any two ormore of these memory components. In addition, the RAM may comprise, forexample, static random access memory (SRAM), dynamic random accessmemory (DRAM), or magnetic random access memory (MRAM) and other suchdevices. The ROM may comprise, for example, a programmable read-onlymemory (PROM), an erasable programmable read-only memory (EPROM), anelectrically erasable programmable read-only memory (EEPROM), or otherlike memory device.

Also, the processor 703 may represent multiple processors, and thememory 706 may represent multiple memories that operate in parallelprocessing circuits, respectively. In such a case, the local interface709 may be an appropriate network 109 (FIG. 1) that facilitatescommunication between any two of the multiple processors 703, betweenany processor 703 and any of the memories 706, or between any two of thememories 706, etc. The local interface 709 may comprise additionalsystems designed to coordinate this communication, including, forexample, performing load balancing. The processor 703 may be ofelectrical or of some other available construction.

Although the data store management application 118, context-orderedgroup delivery service 119, and other various systems described hereinmay be embodied in software or code executed by general purpose hardwareas discussed above, as an alternative the same may also be embodied indedicated hardware or a combination of software/general purpose hardwareand dedicated hardware. If embodied in dedicated hardware, each can beimplemented as a circuit or state machine that employs any one of or acombination of a number of technologies. These technologies may include,but are not limited to, discrete logic circuits having logic gates forimplementing various logic functions upon an application of one or moredata signals, application specific integrated circuits havingappropriate logic gates, or other components, etc. Such technologies aregenerally well known by those skilled in the art and, consequently, arenot described in detail herein.

The flowcharts of FIGS. 3, 4, 5, and 6 show the functionality andoperation of an implementation of portions of the data store managementapplication 118. If embodied in software, each block may represent amodule, segment, or portion of code that comprises program instructionsto implement the specified logical function(s). The program instructionsmay be embodied in the form of source code that comprises human-readablestatements written in a programming language or machine code thatcomprises numerical instructions recognizable by a suitable executionsystem such as a processor 703 in a computer system or other system. Themachine code may be converted from the source code, etc. If embodied inhardware, each block may represent a circuit or a number ofinterconnected circuits to implement the specified logical function(s).

Although the flowcharts of FIGS. 3, 4, 5, and 6 show a specific order ofexecution, it is understood that the order of execution may differ fromthat which is depicted. For example, the order of execution of two ormore blocks may be scrambled relative to the order shown. Also, two ormore blocks shown in succession in FIGS. 3, 4, 5, and 6 may be executedconcurrently or with partial concurrence. Further, in some embodiments,one or more of the blocks shown in FIGS. 3, 4, 5, and 6 may be skippedor omitted. In addition, any number of counters, state variables,warning semaphores, or messages might be added to the logical flowdescribed herein, for purposes of enhanced utility, accounting,performance measurement, or providing troubleshooting aids, etc. It isunderstood that all such variations are within the scope of the presentdisclosure.

Also, any logic or application described herein, including the datastore management application 118 and context-ordered group deliveryservice 119, that comprises software or code can be embodied in anynon-transitory computer-readable medium for use by or in connection withan instruction execution system such as, for example, a processor 703 ina computer system or other system. In this sense, the logic maycomprise, for example, statements including instructions anddeclarations that can be fetched from the computer-readable medium andexecuted by the instruction execution system. In the context of thepresent disclosure, a “computer-readable medium” can be any medium thatcan contain, store, or maintain the logic or application describedherein for use by or in connection with the instruction executionsystem. The computer-readable medium can comprise any one of manyphysical media such as, for example, magnetic, optical, or semiconductormedia. More specific examples of a suitable computer-readable mediumwould include, but are not limited to, magnetic tapes, magnetic floppydiskettes, magnetic hard drives, memory cards, solid-state drives, USBflash drives, or optical discs. Also, the computer-readable medium maybe a random access memory (RAM) including, for example, static randomaccess memory (SRAM) and dynamic random access memory (DRAM), ormagnetic random access memory (MRAM). In addition, the computer-readablemedium may be a read-only memory (ROM), a programmable read-only memory(PROM), an erasable programmable read-only memory (EPROM), anelectrically erasable programmable read-only memory (EEPROM), or othertype of memory device.

It should be emphasized that the above-described embodiments of thepresent disclosure are merely possible examples of implementations setforth for a clear understanding of the principles of the disclosure.Many variations and modifications may be made to the above-describedembodiment(s) without departing substantially from the spirit andprinciples of the disclosure. All such modifications and variations areintended to be included herein within the scope of this disclosure andprotected by the following claims.

Therefore, the following is claimed:
 1. A non-transitorycomputer-readable medium embodying a program executable in a computingdevice, the program, when executed, causing the computing device to atleast: transition a respective one of a plurality of slave nodes to afailover quorum non-eligibility mode responsive to receiving a firsttransition event; cease to acknowledge, upon transition to the failoverquorum non-eligibility mode, any replication requests received from amaster node, the master node and the plurality of slave nodes forming adistributed data store; perform, after ceasing to acknowledge thereplication requests, a first notification to notify the master node andat least one of a remainder of the plurality of slave nodes in thedistributed data store that the failover quorum non-eligibility mode hasbeen entered by the respective one of the plurality of slave nodes, theremainder of the plurality of slave nodes excluding the respective oneof the plurality of slave nodes; transition from the failover quorumnon-eligibility mode to a failover quorum eligibility mode responsive toreceiving a second transition event; perform, upon transition to thefailover quorum eligibility mode, a second notification to notify the atleast one of the remainder of the plurality of slave nodes in thedistributed data store that the failover quorum eligibility mode hasbeen entered by the respective one of the plurality of slave nodes;wait, after the second notification that the failover quorum eligibilitymode has been entered by the respective one of the plurality of slavenodes, for acknowledgment of the second notification from alocality-based durability super quorum which excludes the respective oneof the plurality of slave nodes entering the failover quorum eligibilitymode; resume, after waiting for the acknowledgment of the secondnotification, acknowledgement of the replication requests received fromthe master node; wherein individual ones of the plurality of slave nodesin the distributed data store reside at a respective data center withina plurality of data centers, the locality-based durability super quorumbeing made of at least one of the plurality of slave nodes residing inrespective ones of the plurality data centers but excluding therespective one of the plurality of slave nodes entering the failoverquorum eligibility mode.
 2. The non-transitory computer-readable mediumof claim 1, wherein the first transition event is a message received aspart of a graceful shutdown.
 3. The non-transitory computer-readablemedium of claim 1, wherein the first transition event is a messageoriginating from a system operator.
 4. A method of distributing dataitems, comprising: determining a new master candidate through anelection among a plurality of slave nodes forming a distributed datastore; waiting to perform a failover from a failed master to the newmaster candidate until a consensus is reached among a locality-basedfailover quorum of the plurality of slave nodes that excludes any of theplurality of slave nodes that are in a failover quorum non-eligibilitymode; wherein individual ones of the plurality of slave nodes in thedistributed data store reside at a respective data center within aplurality of data centers, a locality-based durability super quorumbeing made of at least one of the plurality of slave nodes residing inrespective ones of the plurality of data centers but excluding arespective one of the plurality of slave nodes entering the failoverquorum eligibility mode; entering the respective one of the plurality ofslave nodes into the failover quorum non-eligibility mode by firstceasing to acknowledge any replication requests received from a masternode and then performing a first notification to notify at least one ofa remainder of the plurality of slave nodes in the distributed datastore of entry into the failover quorum non-eligibility mode, theremainder of the plurality of slave nodes excluding the respective oneof the plurality of slave nodes; and entering the respective one of theplurality of slave nodes into a failover quorum eligibility mode byfirst performing a second notification to notify the at least one of theremainder of the plurality of slave nodes in the distributed data storeof entry into the failover quorum eligibility mode, then waiting foracknowledgment of the second notification from the locality-baseddurability super quorum, and then resuming acknowledgement of thereplication requests received from the master node.
 5. The method ofclaim 4, wherein the locality-based failover quorum includes N-K+1 ofthe plurality of data centers but excludes any of the plurality of slavenodes that are in the failover quorum non-eligibility mode, wherein N isa size of the plurality of data centers and K is a durabilityrequirement.
 6. The method of claim 4, wherein the first notification,second notification, and acknowledging all utilize a context-orderedgroup delivery message service.
 7. The method of claim 4, wherein thefirst notification, second notification, and acknowledging all utilize acontext-ordered group message delivery service in which the messagedelivery is described by a directed acyclical graph.
 8. The method ofclaim 4, further comprising: while in the failover quorumnon-eligibility mode, processing the replication requests from themaster node; and transitioning from the failover quorum non-eligibilitymode to the failover quorum eligibility mode when the processedreplication requests have caught up with the master node and theplurality of slave nodes in the distributed data store.
 9. The method ofclaim 4, further comprising sending a message, on behalf of anotherrespective one of the plurality of slave nodes in the distributed datastore, indicating that the other respective one of the plurality ofslave nodes has entered the failover quorum non-eligibility mode. 10.The method of claim 4, wherein ceasing to acknowledge any replicationrequests received from the master node is responsive to receiving afirst transition event.
 11. The method of claim 4, wherein transitioningfrom the failover quorum non-eligibility mode to the failover quorumeligibility mode occurs responsive to a message received from a systemoperator.
 12. A system for distributing data items, comprising: at leastone computing device configured to at least: cease to acknowledge anyreplication requests received from a master node, the master node and aplurality of slave nodes forming a distributed data store; perform,after ceasing to acknowledge the replication requests, a firstnotification to notify the master node and at least one of a remainderof the plurality of slave nodes in the distributed data store that afailover quorum non-eligibility mode has been entered by a respectiveone of the plurality of slave nodes, the remainder of the plurality ofslave nodes excluding the respective one of the plurality of slavenodes; perform, upon a transition from the failover quorumnon-eligibility mode to a failover quorum eligibility mode, a secondnotification to notify the master node and the at least one of theremainder of the plurality of slave nodes in the distributed data storethat the failover quorum eligibility mode has been entered by therespective one of the plurality of slave nodes; wait, after the secondnotification that the failover quorum eligibility mode has been entered,for acknowledgment of the second notification from a locality-baseddurability super quorum; resume, after waiting for the acknowledgment ofthe second notification, acknowledgment of the replication requestsreceived from the master node; and wherein individual ones of theplurality of slave nodes in the distributed data store reside at arespective data center within a plurality of data centers, the pluralityof locality-based durability super quorum being made of at least one ofthe slave nodes residing in respective ones of the plurality of datacenters but excludes the respective one of the plurality of slave nodesentering the failover quorum eligibility mode.
 13. The system of claim12, wherein the first notification, second notification, andacknowledging all utilize a context-ordered group delivery messageservice.
 14. The system of claim 12, wherein the first notification,second notification, and acknowledging all utilize a context-orderedgroup message delivery service in which the message delivery isdescribed by a directed acyclical graph.
 15. The system of claim 12,wherein the at least one computing device is further configured to atleast: process replication requests from the master node while in thefailover quorum non-eligibility mode; determine when the processedreplication requests have caught up with the master node and theplurality of slave nodes in the distributed data store; and wherein thetransition from the failover quorum non-eligibility mode to the failoverquorum eligibility mode occurs responsive to the determination.
 16. Thesystem of claim 12, wherein ceasing to acknowledge any replicationrequests received from the master node is responsive to receiving afirst transition event.
 17. The system of claim 12, wherein ceasing toacknowledge any replication requests received from the master node isresponsive to a message received as part of a graceful shutdown.
 18. Thesystem of claim 12, wherein the transition from the failover quorumnon-eligibility mode to the failover quorum eligibility mode occursresponsive to a message received from a system operator.
 19. The systemof claim 12, wherein the transition from the failover quorumnon-eligibility mode to the failover quorum eligibility mode occursresponsive to receiving a second transition event.